The original formulation of gradient descent just takes a step in parameter space away from the direction of the steepest grade:

This strategy forces us to choose a single learning rate, . When the learning rate is very small, this strategy converges slowly; it also gets stuck at local minima. When the learning rate is large, it jumps all over the place and may never converge at all. Also, if the some parameters are much more sensitive than others, the learning rate might be good for one and bad for another.

Some things we can do without trying a new algorithm:

  • Learning rate: We can improve this situation by allowing to decrease over time. This gives us the benefit of fast learning early on, while making it easier to converge as we get close to an optimum.

  • Stochastic gradient noise: We can also throw a noise term in there, to help us avoid getting stuck at local optima.

  • Random restarts: We can run the process from multiple random starting points to increase our chance of finding a global optimum.

  • Early stopping: We can stop the process when the rate of change in the error falls below some threshold.